Part1

1.Import and warehouse data:

Import all the given datasets and explore shape and size.

Merge all datasets onto one and explore final shape and size

Export the final dataset and store it on local machine in .csv, .xlsx and .json format for future use.

Import the data from above steps into python.

2.Data cleansing:

Missing/incorrect value treatment

Except HP the other features do not have any missing values - we will see what is the correlating feature for hp

Drop attribute/s if required using relevant functional knowledge

Perform another kind of corrections/treatment on the data

1. One hot encode the "yr" columns - this dataframe will be "dff"

2. Label Encode "yr" - this dataframe will be "df"

Data analysis & visualisation:

Perform detailed statistical analysis on the data.

We can confirm that years does play a role in MPG increase or decrease

We can confirm that origin does play a role in MPG increase or decrease

We can also confirm that cylinder does play a role in MPG increase or decrease

Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.

Origin : With the increase in origin the mpg also increases, for eg: If a country A produces good vehicles that has high mpg then that country will have the highest value.

Accelaration : The higher the accelaration the better the mpg, as per the data!!. Considering the accelaration value being only till 24, i think accelaration at 24 miles per hour is good as you cannot drive too slow and expect to have a higher mileage. You always need to drive at a standarad speed and maintain the speed to achieve the mileage.

Cylinder : The higher the number of cylinders the higher the power output of engine, the higher the power output of engine the lower the mpg

Displacement : The higher the engine displacement , the higher the engine power and the more the power lesser the mpg

Horse Power: The horse power explains the amount of power that engine can produce, the higher the horse power the higher is the engine power and the lower is the mpg.

Weight : The weight is the major factor that affects the mpg, the higher the weight the lower the mpg.

Origin

Accelaration

Cylinders

HP

Bugatti veyron has 1001 hp but with least mileage

Displacement

Weight

year

Machine learning:

Use K Means and Hierarchical clustering to find out the optimal number of clusters in the data.

We will use pair plot to find approx number of cluster

Scaling the data before clustering

Using Kmeans on df dataframe which has label encoded "yr"

Gives a good elbow

Using Kmeans on dff dataframe that has year one hot encoded

Hirerachichal clustering

Share your insights about the difference in using these two methods

1. Both these methods have a slight difference, k means clustering takes centroid and uses this to form cluster by comparing it with all points and also it requires k value in advance.

2. In hirechichal clustering we end up forming clusters based on hierarchy by comparing each point with every other and form clusters based on nearby points and go till the end of clustering all the points. Finally we get to have the independence of selecting any number of clusters.

3. In general hirechical clustering is time consuming and hence its not mostly preferred but k means is quicker so we pick K means clustering.

4. Also Hirechical clustering is more informative than k means and this leads to identify the number of clusters with ease whereas its not the same in kmeans.

5. Note: Kmeans clustering takes random centroid so it always requires multiple runs to make sure that all the points are clustered at right location without having incorrect clusters

Answer below questions based on outcomes of using ML based methods.

Mention how many optimal clusters are present in the data and what could be the possible reason behind it.

1. As per the data, i personally feel there are 3 clusters as we get to see a distinct number on all features for k = 3

2. The k = 3 clusters is because the years field contains 70 - 82 which shows almost a decade of time while car is manufactured, it always takes a decade to really come up with an efficient engine, also the data was speaking the same. The cars manufactured from 70 - 73 had an average mpg of 18, cars manufacured post 79 had mpg of 31 and the middle years of 74 - 79 had a mediocre mileage of 22

3. So, it could be that from 70 - 73 the manufacturing sector must have worked on engines but they might not have been efficient, from 74 - 79 engines might have got slightly better with improvement in technology and post 79 is almost a decade since 70 so there definetly is a great improvement in quality of engines leading to better mpg

4. We prefer taking the label encoded years dataset ie: df but we will still explain on both df and dff

Use linear regression model on different clusters separately and print the coefficients of the models individually

Focusing on df dataframe that has yr label encoded

How using different models for different clusters will be helpful in this case and how it will be different than using one single model without clustering? Mention how it impacts performance and prediction.

Linear Regression Impact

Lasso Regression - Both Linear regression and Lass regression look the same

In this case using Linear model on a complete Dataset will be the best as the Predicted vs Actual looks better on complete dataset. This could also be because of low number of records.

The total dataset contains 398 rows but when clustered the data gets split into approx 133 each on all 3 clusters which is a very less number of records to train and make a prediction.

The Clustered version contains a very bad predicted vs Actual, some of them look like clouds but we could see a bit of linearity in the data.

Also the adjusted R Squared score for:

Complete dataset is 83.7%

For Cluster 0 : 40.99%, Cluster 1 : 38.22% and for cluster 2 : 58.75%

On the whole if the Dataset contains more number of records and if the data is clustered then it will lead to splitting Gaussians of different clusters. This will inturn reduce the skewness and also the outliers which is the main issue of Linear regression but in this case the number of records is too less that when it is splitted the clusters guassian are split but there is not enough record to train and make a good prediction.

Linear Regression's main issue is the outliers as it assumes that the data is linear but when the outliers come in place then it deflects the prediction and makes it worse.

In that case if we could cluster the data say 3 ideal clusters then it means that there were 3 guassian distributions that were merged and merging of different guassians causes skewness and outliers. Once we split them using clusters the chances of skewness and outliers will be drastically reduced and post this linear model can be applied on all 3 clusters to get a good prediction. (Provided clustered version contains enough track records to train with)

Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the company to perform a better data analysis in future.

1. In this case we have a very less number of data to split into clusters and apply linear regression. Theoritically we know that if a data is a mix up of multiple guassians then its better to cluster them and apply linear regression.Also If the data is of high quantity, we have more data but if the quality of the data is bad or data is unrealistic data then your model will not be trained well to make the right predictions. On the other hand if the Quality of the data is good which means the data is complete realistic data that adds to the model, if the data's quantity is less then the model will not have enough information to prepare for all the scenario's, this is the scenario that we are facing here.

So, in general its always good to maintain a great amount of balance in Quality and Quantity of the data to have a perfect model. In this case we might require more data so that we can cluster them and apply techniques.

2. Data can come in different formats, in this case it can be structured data and unstructured data. Also, the data can be in the form of texts, images, audio, video's, different file formats like Excel, pdf etc. In these cases we must have a mechanism to identify the formats and process them to convert them into a structured data so that it is useful for us to perform analysis and build better models in future.

3. With the data coming in from different sources and different locations, you can always expect a lot of data to come in at a higher velocity. It is always good to have a Bigdata technology applied like using distributed systems instead of Local Systems. A Local system might not be able to handle a huge amount of data coming in, it cannot work on processing data at high speeds as it is completely dependent on 1 systems whereas in Distributed system we will have multiple systems connected to a single master node. Here, we have the advantage of multiple systems working together to handle big data, usually a Map reduce or Hadoop technique is used here.

4. With all issues rectified we still have an issue that is identifying the noisy data. There might be few data's coming in that has some abnormality in it, these data's need to be identified, processed or filtered out so that we do not have any issues. In general if we get any blank data, these can naturally be ignored(removed) or we can get back to the same customer to get right amount of data. If abnormality is found then one can use variety of techniques like log transformer, normalization to transform them. If we have outliers and the outliers are of feqw numbers situated close to the Q1 or Q3 then we can fill the outliers with Q1 and Q3 respectively but if the outliers are more then this would not be a good idea as replacing those with Q1 and Q3 will create multi modal distributions. Missing values can be imputed using ml algorithms as we did in this project or if the missing values are too less compared to the total number of data we can drop them or impute them with media. But, we cannot impute all the missing values or more number of missing values with median as it sharpens the curve in guassian and reduces the variation in data.

5. Finally a better portal with a good data base needs to be constructed so that all the details are neatly gathered and stored. For eg: We can have a database that will be able to collect details and stored in a proper database which can be taken out later.

Part2

Design a synthetic data generation model which can impute values [Attribute: Quality] wherever empty the company has missed recording the data

1. Logistic Regression

2.Random Forest Classifier

3.Kmeans clustering

In general we can use any model to impute the missing values, the missing values are discrete ie: classification model is required.

We can use Logistic Regression, SVM, KNN, Naive Bayes Theorem, Decision Tree classifiers, Random Forest classifiers or even K means clustering

For eg: we have used Logistic regression, RandomForest classifier and K means clustering to impute the missing values

Pickling all the 3 models and loading them back

We can use any Data generation model to impute the missing values, in this case being a classification problem, one can use any classification technique to impute.

Since we are focusing on K means clustering in this project, i have taken K means clustering to impute the missing values and i have also tested it with the other models such as Logistic regression and Random Forest classifiers.

Part 3

1. Data: Import, clean and pre-process the data

EDA and visualisation: Create a detailed performance report using univariate, bi-variate and multivariate EDA techniques. Find out all possible hidden patterns by using all possible methods.

Filling Circularity

We have filled Circularity and scaled_radius_of_gyration

We have filled radius_ratio, elongatedness and scaled_variance

We have filled skewness_about column

scaled_radius_of_gyration.1 column is filled

We have also filled distance_circularity

We have filled pr.axis_rectangularity column

All the missing values have been filled

We haven't missed any of the missing values by dropping them

Design and train a best fit SVM classier using all the data attributes.

Balancing the classes using Oversampling and SMOTE technique

Oversampling version gives us a good model with a good F1 score

Dimensional reduction: perform dimensional reduction on the data.

1. Scale the data by using Standarad scaler or Z score

Classifier: Design and train a best fit SVM classier using dimensionally reduced attributes.

With Unbalanced Dataset

Applying Oversampling

Applying SMOTE Tomek

Conclusion: Showcase key pointer on how dimensional reduction helped in this case.

Usually a Dimensionality reduction technique will help by reducing the computational time, multicollinearity issues, Avoids curse of dimensionality and also reduces overfitting

But here we notice that Dimensionality reduction technique does not add much effect if we keep the PCA components as 3.

With over 18 columns as inputs, if we use n_components as 3 we get a bad score in training and also in testing. If we could improve the n_components to 9 or 12 we end up getting better results.

I would prefer taking the complete dataset, balance the classes Using SMOTE TOMEK technique, apply PCA technique and reduce the dimensions to 9 from 18 dimensions and use Support vector classifier with GridsearchCV with Stratified CV to get the best output.

Although the Dimensionality reduction technique might look appropriate if 18 columns are reduced to 3 or 4 but here it works well for 9 dimensions which is half the size of original dataset.

When we dint apply dimensionality reduction we got :

Without PCA

With Oversampled data : F1 score of 89% for class0, 83% for class1, 90% for class2 was achieved --> Best Model!!

With SMOTE data : we got F1 score of 88% for class0, 81% for class1, 87% for class2 was achieved

With PCA - We got the best output in SMOTE balanced data:

We got F1 score of 95% for Class0, 89% for class1, 91% for class2.

So, we get the best model when PCA is applied

N=9 is the least dimension and also has a better result

Part 4 - Sports Management

Goal is to build a data driven batsman ranking model for the sports management company to make business decisions.

1. The null / missing values look like blanks everywhere so if identified as blanks we can drop all of them

1. Univariate Bivariate Multivariate analysis

What we see is the average runs scored and their comparitive Total runs. This cannot be explained clearly until the total number of matches played is present.

A player could have high average runs(50) but might have played only fewer matches (5) which leads to having lower runs (250)

whereas on the other hand a player with comparitively low average (45) who played more number of matches (15) will definetly have higher runs (675)

So without number of matches played column we will not be able to deduce the better player, this can be generated using Average runs and Runs.

No of Matches = Total Runs / Average runs

This cant be an actual number of matches but it can be an approximate number

Part 5

List down all possible dimensionality reduction techniques that can be implemented using python.

1. If there are more missing values in a column and they are of not much importance in predicting the target, they can be dropped instead of imputing.

For eg: Say we have 7 features and 1 target with total 10000 rows of data, out of this if a column1 is missing aroung 7000 rows of data then what i do is that i drop na and pick up the rest of the data and identify the column1's importance in predicting target, also i will check the other feature if they carry better prediction details and if column1 has very less predicting power then i would remove that column instead of imputing as imputing with mean, median will definetly reduce the variance in data and would completely change the model. Its better to remove the column if there is more than 50% of data missing and the predicting power is also low

2. If i am working on a linear model and i have multiple columns that contribute to the prediction then we will definetly face multicollinearity problem. In that case we can drop some columns.

For eg: If there is a dataset with 7 feature and 1 continuous target variable, say all 7 feature contribute to the targe (ie) correlation is either high positive or high negative, in that case i would prefer to check the Variation inflation factor VIF of columns and those which have high VIF or those columns that correlate with other columns the best will be dropped. In other words the high multi collinear columns will be dropped and only few will be kept.

3. Using PCA Principal Component Analysis for dimensionality reduction, in this method we have the independence to choose the number of columns we would like to have. We can create Principal components with the number of columns as per our wish and depending on the amount of variance explained by each eigen values we can pick specific number of principal components to make prediction. Usually principal component analysis reduces the multicollinearity problem, Curse of dimensionality problem, reduces the noise in data so it is the best in dimensionality reduction.

4. RFC Random forest classifiers: This algorithm comes with the feature importance attribute, once you fit a set of data to RFC then you will be able to identify the feature importance and those columns that have least importance can be dropped and not used. This is also one of a good dimensionality reduction technique after PCA.

5. By google, i got to know the Linear Discriminant Analysis (LDA) Generalized Discriminant Analysis (GDA) also being a dimensionality reduction techniques but havent understood the concepts.

So far you have used dimensional reduction on numeric data. Is it possible to do the same on a multimedia data [images and video] and text data ? Please illustrate your findings using a simple implementation on python

Since images are RGB type with multiple dimensions, they usually take more storage and are highly complex as well due to higher dimensions.

We can use PCA to reduce the dimensions even more lesser and then post this the image might look even more pixelated but it will be clear enough for us to identify the right numbers

Since i do not have any idea on text, image or video preprocessing, i could not provide any more extra information. I have used this data as per the mentor session and i have understood the concept but implementing is a bit tougher since i have not idea on it.